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Although a wide range of risk factors for coronary heart dis- 
ease have been identified from population studies, these 
measures, singly or in combination, are insufficiently pow- 
erful to provide a reliable, noninvasive diagnosis of the 
presence of coronary heart disease. Here we show that pat- 
tern-recognition techniques applied to proton nuclear mag- 
netic resonance ('H-NMR) spectra of human serum can 
correctly diagnose not only the presence, but also the 
severity, of coronary heart disease. Application of super- 
vised partial least squares-discriminant analysis to orthogo- 
nal signal-corrected data sets allows >90% of subjects with 
stenosis of all three major coronary vessels to be distin- 
guished from subjects with angiographlcally normal coro- 
nary arteries, with a specificity of >90%. Our studies show 
for the first time a technique capable of providing an accu- 
rate, noninvasive and rapid diagnosis of coronary heart dis- 
ease that can be used clinically, either in population 
screening or to allow effective targeting of treatments such 
as statins. 

Coronary heart disease (CHD) is a major cause of mortality 
and morbidity in developed countries, affecting as many as 
one in three individuals before the age of 70 years'. Over the 
past three decades a range of environmental and biochemi- 
cal risk factors for the development of CHD have been iden- 
tified in cross-sectional studies 2 . For example, tobacco 
smoking is associated with an approximately two-fold in- 
creased risk of CHD 3 . Similarly, high levels of cholesterol in 
large, triglyceride-rich lipoprotein particles (mainly very 
low-density lipoprotein (VLDL) and low-density lipoprotein 
(LDL)) and lower levels of cholesterol in high-density 
lipoprotein (HDL) particles are known to be associated with 
increased risk of CHD 4 . 

These epidemiological studies have been very useful in 
several ways. First, they have underpinned public health 
policy on a range of issues, discouraging tobacco smoking 
and promoting a low-cholesterol diet 5 . Second, they have 
provided vital clues as to the underlying molecular mecha- 
nisms that cause atherosclerosis and CHD 6 . However, the 
risk factors identified so far from cross-sectional epidemio- 
logical studies are insufficiently powerful to provide a clini- 



cally useful diagnosis of CHD. Although algorithms have 
been designed based on a range of risk factors, such as age, 
sex, lipoprotein levels and blood pressure, which can iden- 
tify subpopulations at very significant excess risk of CHD, 
even the best of these based on the excellent Prospective 
Cardiovascular MOnster (PROCAM) study in MOnster, 
Germany, cannot diagnose the presence of CHD on an indi- 
vidual-by-individual basis 7,8 . 

Recently, however, there have been technical advances 
that have allowed extremely high-density data sets to be 
constructed from individuals. Techniques such as genomics, 
proteomics and metabonomics (a systems approach to ex- 
amining the changes in hundreds or thousands of low-mol- 
ecular-weight metabolites in an intact tissue or biofluid 9 ) 
offer the prospect of efficiently distinguishing individuals 
with particular disease or toxic states. Of these techniques, 
NMR-based metabonomics offers several distinct advantages 
in a clinical setting. First, it can be carried out on standard 
preparations of serum, plasma or urine 1 * 11 , circumventing 
the need for specialist preparations of cellular RNA and pro- 
tein required for genomics and proteomics, respectively 12 " 14 . 
Second, many of the risk factors already identified (such as 
levels of various lipids) are small-molecule metabolites that 
will contribute to the metabonomic data set. 

In this study we have applied recently developed pattern- 
recognition techniques to NMR spectra of either serum or 
plasma taken from individuals who have been extensively 
characterized, both for the presence of CHD by the gold-stan- 
dard angiographic technique and for a wide range of conven- 
tional risk factors. This allows direct comparison of the 
performance of the metabonomic analysis as a diagnostic 
technique with algorithms based on conventional risk factors. 

The 600-MHz 'H-NMR spectra of human sera from pa- 
tients with severe CHD (triple vessel disease (TVD) patients; 
n = 36) and patients with anglographically normal coronary 
arteries (NCA patients; n = 30) were compared visually (Figs, 
la and b). The clinical characteristics of the populations 
studied here are provided in the Supplemental Note online. 
Few systematic differences were detected when the two 
groups were compared visually. Chemical components were 
assigned to the spectra on the basis of previously published 
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Fig. 1 Comparison of patients with severe atherosclerosis (TVD) and 
patients with normal coronary arteries (NCA). The 600 MHz 'H-NMR 
spectra of serum samples from a typical NCA patient (a) and a TVD pa- 
tient (6) are shown. The chemical shifts of a selection of major metabo- 
lites are indicated (based on assignments from 2D-TOCSY spectra), 
although these metabolites do not contribute all (or in some cases, even 
most) of the signal at the indicated chemical shift, c, Data-reduced 1 H- 
NMR spectrum of a serum sample from a typical TVD patient. The re- 
gion between 64.5 and 56 has been deleted to reduce the likelihood of 
any variance contribution from incomplete suppression of the water sig- 
nal. <f, PIS-DA scores plot showing the considerable separation 
achieved between NCA (▲) and TVD (■) samples. Note that optimum 



separation occurred in the second and third principal components (t[2] 
and t[3]). e, The regression coefficients of the PLS-DA model shown in 
(d). Positive coefficients indicate relatively, higher values for that spectral 
region in the TVD samples compared with the NCA samples, whereas 
negative coefficients indicate lower values. The magnitude of the coeffi- 
cient represents the relative importance of each data bin on the separa- 
tion achieved in (d). f, PLS-DA scores plot after application of the OSC 
data filter to remove uncorrected variance components. Note the con- 
siderable improvement in separation achieved (compared with d 
above), which now occurs in the first two principal components (tp] 
and t(2]). g, The regression coefficients for the PLS-DA model using the 
OSC transformed data set 



data 1516 . To reduce the complexity of the NMR data to facili- 
tate pattern recognition, the spectra were automatically 
data-reduced to 245 integral segments, each comprising 
0.04 p. p.m., before chemometric analysis (Fig. lc). 

To determine whether it was possible to distinguish TVD 
and NCA patients on the basis of the NMR spectra, we car- 
ried out principal components analysis 17 (PCA) and partial 
least squares-discriminant analysis 17 (PLS-DA). The PLS-DA 
scores plot of the second and third principal components 
(PC2 and PC3) shows that, although there was overlap be- 
tween the two sample classes, some clustering was evident 
(Fig. Id). The regions of the NMR spectrum that most 
strongly influence separation between NCA and TVD sam- 
ples are indicated by the regression coefficients (Fig. le). The 
coefficients were derived from the PLS-DA model and each 
bar represents a spectral region covering 0.04 p. p.m., show- 
ing how the 'H-NMR profile of the TVD samples differed 
from the •H-NMR profile of the NCA serum samples. A posi- 
tive value indicated there was a relatively greater concentra- 
tion of metabolite (assigned using NMR chemical shift 
assignment tables) present in TVD samples and a negative 
value indicated a relatively lower concentration, with re- 
spect to NCA samples. In general, the regression coeffi- 
cients, or loadings, most influential for the TVD samples lie 
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around 60.86 (due mainly to CH 3 groups from fatty acid side 
chains in lipids, in particular LDL and VLDL) and 51.26, 
81.3 and 61.34 (due mainly to (CH 2 )„ groups from fatty acid 
side chains in lipids, in particular VLDL and LDL). The load- 
ings most influential for the NCA samples lie around 61.22 
(due mainly to (CH Z )„ groups from fatty acid side chains in 
lipids, in particular HDL) and 83.22 (due to choline 
-N(CH 3 ) 3 *). The region at 53.22 is assigned to -N(CH 3 ) 3 * 
groups in molecules containing the choline moiety, princi- 
pally phosphatidylcholine from lipoproteins, mainly HDL, 
based on the known phospholipid content of lipoproteins. 

If chemometric analysis is suggestive of separation be- 
tween the classes under investigation, orthogonal signal 
correction (OSC) can be used to optimize the separation 18 , 
thus improving the performance of subsequent multivariate 
pattern-recognition analysis and enhancing the predictive 
power of the model. After application of OSC, the TVD and 
NCA groups were well separated in the PLS-DA scores plot of 
PCI and PC2 (Fig. 1/). The regression coefficients (Fig. lg) 
indicated that the same regions of the spectra that con- 
tributed to the clustering in the unfiltered data set also con- 
tributed to the clustering seen after application of OSC. The 
statistically significant loadings are presented in the 
Supplemental Note online. 
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Approximately 80% of the samples (the 'training set') were 
then selected at random to construct a PLS-DA model that 
could then be used to predict the class membership of the re- 
maining 20% of samples (the 'test set'). The regression coeffi- 
cients for the training set again indicated that the same 
spectral regions contributed most strongly to the discrimina- 
tion of the classes: lipids, mostly VLDL, LDL and HDL, and 
choline. The PLS-DA model calculated from OSC-filtered 'H- 
NMR data for the training set predicted the presence of CHD 
with a sensitivity of 92% and a specificity of 93% based on a 
99% confidence limit for class membership (Fig. 2). The Y- 
predicted scatter plot assigned samples to either class 1 
(TVD) or class 0 (NCA) using an a priori cut-off of 0.5, and 
showed the ability of l H-NMR-based metabonomics to pre- 
dict class membership (NCA or TVD) of unknown samples. 

As the cohort of patients under investigation was selected 
only on the basis of coronary artery disease status, the pro- 
portion of men and women in the two groups and their age 
distributions are different (see Supplemental Note online). 
Such gender bias is inevitable if we are to avoid overempha- 
sizing the power of our diagnostic assay by a rtif actually re- 
moving major sources of variation, as commonly occurs 
when using highly age- and sex-matched groups. However, 
the classification of the patients as NCA or TVD on the basis 
of the NMR spectrum did not depend on the sex bias of the 
patient groups: most male NCA subjects were correctly classi- 
fied as NCA, despite the fact that the NCA group was primar- 
ily composed of women (Fisher's exact test; P = 0.006). In 
contrast, if gender rather than artery status is used to catego- 
rize the samples, PLS-DA is able to classify the individuals on 
the basis of sex with 100% sensitivity and specificity (data not 
shown). This demonstrates that our metabonomic analysis is 
able to diagnose the presence of CHD against the background 
variation in the gender distribution of the groups, but is not 
relying on gender differences to make the diagnosis. 

This study demonstrated that 'H-NMR-based metabo- 
nomic analysis of serum samples, in itself minimally inva- 
sive and non-destructive of sample, can achieve a clinically 
useful diagnostic performance, when compared with inva- 
sive angiography. 

To determine whether *H-NMR-based metabonomic 
analysis could distinguish the severity of CHD present, we 
collected a set of samples from individuals with stenosis of 
one (mild, n = 28), two (moderate, n = 20) or three (severe, n 
m 28) major coronary arteries. Although this is only a crude 
indicator of disease severity, it is plausible that the number 
of stenosed vessels correlated (at least weakly) with athero- 
sclerotic plaque load in the whole body. The 600-MHz l H- 
NMR spectra from the 76 patients with CHD of varying 
severity were obtained and analyzed by PCA and PLS-DA. 
Following application of OSC to 'H-NMR data, separation 
between the mild, moderate and severe CHD samples was 
evident and the regression coefficients indicated that once 
again the lipid parameters contributed most strongly to the 
separation (Figs. 3a and b). To optimally view the separation 
between the mild, moderate and severe CHD samples, PLS- 
DA models and regression coefficients were calculated for 
mild and moderate CHD (Figs. 3c and d), moderate and se- 
vere CHD (Figs. 3c and f), and mild and severe CHD (Figs. 3g 
and h). For the models shown in Figs. 3a, c and#, the second 
principal component was not statistically significant in PLS- 
DA, but is shown for clarity. 



The mild, moderate and severe CHD samples were also 
compared on the basis of established clinical risk factors 
(tabulated in full in the Supplemental Note online). None of 
the risk factors measured (including age, blood pressure, 
LDL and HDL cholesterol, total cholesterol, total triglyc- 
eride, fibrinogen, plasminogen activator inhibitor (PAI-1), 
white blood cell count, creatinine or history of cigarette 
smoking) was significantly different between the three 
groups (P > 0.05 by ANOVA in each case; Supplemental Note 
online). Furthermore, chemometric analysis of these clinical 
data was not able to determine the number of stenosed ves- 
sels. In the PLS-DA model of the clinical data, none of the 
principal components extracted was statistically significant, 
in contrast to the PLS-DA model based on the NMR data 
shown in Fig. 3. We conclude that 'H-NMR-based metabo- 
nomics is better able to distinguish the severity of CHD 
based on a single blood sample than any of the conven- 
tional risk factors yet identified, even when pattern-recogni- 
tion methodology is applied. 

Discussion 

We have demonstrated that it is possible to completely sepa- 
rate CHD patients with stenosis of all three major arteries 
from subjects with normal coronary arteries using both unsu- 
pervised PCA and supervised PLS-DA applied to 'H-NMR spec- 
tra of human serum. Furthermore, using the supervised 
PLS-DA algorithm, it is possible to predict the artery status of 
unknown samples using a training set composed of only 24 
individuals with NCA and 30 individuals with TVD. The 
small size of the training set required to achieve greater than 
90% sensitivity and specificity highlights the power of this 
technique. Substantially larger training sets obtained through 
application of this technique to clinical practice should fur- 
ther improve the diagnostic sensitivity and specificity of the 
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Fig. 2 Prediction of coronary artery status using the PLS-DA model. A 
PLS-DA model was constructed using the OSC filtered data from 24 NCA 
patients (A) and 30 TVD patients (■) (the 'training set'). This model was 
then used to 'predict' the coronary artery status of a further six samples 
of each class that were not used in the construction of the model (the 
'test set'). Predictions are made using a Y-predicted scatter plot with the 
a priori cut-off of 0.5 for class membership. The test sets are shown as 
black diamonds, with their angiographicalry determined artery status 
shown in text. 
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technique. Both PCA and PLS-DA analyses 
were improved by prior application of the 
OSC technique to the data set 18 . On this 
basis, OSC is likely to find widespread appli- 
cation in pattern recognition for high-infor- 
mation-density data sets, not only in 
metabonomics, but also in genomics and 
proteomics. 

Both of the pattern-recognition algo- 
rithms used here rely on extraction of linear 
associations between the input variables, 
which can significantly limit the power of 
the analysis (the strengths and weaknesses 
of our study design are discussed in detail in 
the Supplemental Note online). It is already 
clear that neural-network-based pattern- 
recognition techniques can considerably 
improve the ability to classify individuals 
on the basis of many interrelated input vari- 
ables 19 , particularly when membership in a 
class (such as having CHD) may result from 
one of a range of unrelated causes. 
Nevertheless, the methods we applied are 
sufficiently powerful to allow classification 
of the individuals we studied, and provide 
one additional benefit over neural network 
methods: they allow information to be 
more easily gained as to what aspects of the 
input data set were particularly important 
in allowing the classification to be made. 
These regression coefficients (Figs, le and g) 
indicated an important contribution from 
two of the data regions in particular: the 
bins around chemical shifts of 81.30 and 
83.22. Although the peaks around 81.30 are 
known to result from lipid CH 2 resonances 
and to correlate with the levels of LDL-cho- 
lesterol (r = 0.45; P < 0.01), it is notable that 
only 20-30% of the variance in this bin is 
related to classical measurement of LDL- 
cholesterol concentration. The remaining 
variance is likely to result from subtle chem- 
ical differences in the lipid composition of 
LDL particles between individuals, for ex- 
ample, the degree of fatty-acid side-chain 
unsaturation and lipoprotein-protein mole- 
cular interactions. The assignment of a 
metabolite to a particular chemical shift 
only indicates that the concentration of 
that metabolite contributes to the variance 
seen at that chemical shift. However, that 
contribution may only be a small part of 
the total variance, particularly at low chem- 
ical shifts where many different molecular 
species contribute to each spectral interval. 
The importance of bins with a variance 
component due to lipoprotein composition 
will likely contribute to ongoing studies 
using both NMR and other analytical tech- 
niques to understand the contribution of 
lipoprotein particle composition to the de- 
velopment of CHD 20 . It does, however, em- 
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Fig. 3 Comparison of patients with different severity of coronary atherosclerosis. 
Spectra were generated for a further 76 males all of whom had angiographically proven 
coronary artery disease. These individuals were classified as having mild (n = 28), moder- 
ate (n = 20) or severe (n = 28) disease according to whether they had stenosis (>50% 
blockage) of one, two or three of the major coronary arteries. PLS-DA models were then 
generated on the OSC-transformed dataset exactly as for Fig. 1. (a) PLS-DA model com- 
paring all three seventy groups, mild (A), moderate (•) or severe (■). Individuals with 
mild disease are well separated from the more severe groups in the first principal compo- 
nent. (6) Regression coefficient plot for the PLS-DA model shown in (o). (c and d) PLS-DA 
model scores plot and regression coefficient plot, respectively, produced using only the 
mild and moderate groups, (e and f) PLS-DA model scores plot and regression coefficient 
plot, respectively, produced using only the moderate and severe groups, (g and ft) PLS- 
DA model scores plot and regression coefficient plot, respectively, produced using only 
the mild and severe groups. In each case, positive regression coefficients indicate higher 
signal in the more severe group. Note that the second principal component (t[2J) is not 
statistically significant for the models shown in (o), (c) and (o). 



1442 



NATURE MEDICINE • VOLUMES • NUMBER 12 • DECEMBER 2002 



NEW TECHNOLOGY 





phy. The inclusion anctexclusion criteria we applied are de 



r *<. ' padcage; versjori^.s; Brtikeryj 
the NMRreqfc>o iftvxh.k 




j iBIo^pgal^ 
; plastic tubes for 2,h at room temperature, and the serum was 
collected by centritutjation . Aliqupts of serum were stpreftat-^O 

£ : f 6jjg||^^ acr.. 
yconcjfrl! 
50% 

v excluded ^ drawn ii 

Diatube H tubes, and platelet-poor plasma jwas prepared a" 



as a response vector, Y, to g'escribe the variation between the , 
sample classes. The OSC method 11 ther>, locates th^ibng!^ vec- 
tordesqii^g^^^n bi^^ni^s tha^^r- ,j 
related wjtfe tH^yector and removes it frpm...!the;da^m^%^ 





^IS^L , 

1 a Broken £y1^ Massachusetts) op- 



JdeTl^ ^at|r res : 

fr^diated/an^ Yi "corresponds to a fixeS Sf- 
fatare&^ 

; dua'ng th^mixing timp (Uj^ ms). For eacii^ampje r .^ Fipi ; ; 

» vyere collected into a^^eft^U points using a spectral width of 
8389i Hz and an &c^is|p^Sme 6?1;9%A"eFlpVwe 
tiplied by an exponentlat^weighBhg function coirwpgndirfg to a 

i line broad^ing^f 6^1 Hi Begre Fouriet ;^n^ 
quired NMR spectra were corrected for phase and baseline dis- 
tortions using XWINNMR (version 2<1, Bruker) and referenced to 
lactate (CH, 81.33). 




of *fe^c|pgst(p), which highligbtthe Influence of ifipufyarj^W^ 



^sj& ; pn 



explain rriaximMm.^ 

" matrix' (Y), which describes variation according ;toVSasV.. 
Variatio^^^ 

ings (P), X- and Y-weights (W, Q, and PLS regression coefficients 
(B). Once a PLS-DA model is calculated and validated it can be 
used for prediction of class membership for unknown samples. 



phasize an important facet of high-data-density metabolic 
analysis in that it is entirely unnecessary to understand fully 
the complex molecular differences that underlie the spectral 
features associated with CHD to be able to correctly classify 
individuals with very high sensitivity and specificity. 
Further analysis of the molecular basis of the spectral differ- 
ences, however, will give insight into the mechanistic 
processes involved. 

Whereas currently a firm diagnosis of CHD can only be 
made through application of angiography, which is both ex- 
pensive and invasive, the introduction of metabonomic 
screening would allow diagnosis to be made simply and 
cheaply on the basis of a single blood sample. As angio- 
graphic status was used as the classification variable during 
model construction, the diagnostic assay reported here be- 



haves as a simple replacement for angiography. Like angio- 
graphy, it does not distinguish stable and unstable angina, 
which may be a major factor in determining the likelihood 
of future myocardial infarction. Nevertheless, the availabil- 
ity of a relatively cheap and noninvasive replacement for 
angiography would revolutionize the provision of health 
care for CHD, allowing both widespread population screen- 
ing and more efficient targeting of drugs, such as statins. 
These drugs, although broadly effective in reducing the risk 
of myocardial infarction, are difficult to target to those most 
in need of treatment. 
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